Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

Yang, Geng; Yang, Shan; Liu, Kai; Fang, Peng; Chen, Wei; Xie, Lei

Computer Science > Sound

arXiv:2005.05106 (cs)

[Submitted on 11 May 2020 (v1), last revised 17 Nov 2020 (this version, v2)]

Title:Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

Authors:Geng Yang, Shan Yang, Kai Liu, Peng Fang, Wei Chen, Lei Xie

View PDF

Abstract:In this paper, we propose multi-band MelGAN, a much faster waveform generation model targeting to high-quality text-to-speech. Specifically, we improve the original MelGAN by the following aspects. First, we increase the receptive field of the generator, which is proven to be beneficial to speech generation. Second, we substitute the feature matching loss with the multi-resolution STFT loss to better measure the difference between fake and real speech. Together with pre-training, this improvement leads to both better quality and better training stability. More importantly, we extend MelGAN with multi-band processing: the generator takes mel-spectrograms as input and produces sub-band signals which are subsequently summed back to full-band signals as discriminator input. The proposed multi-band MelGAN has achieved high MOS of 4.34 and 4.22 in waveform generation and TTS, respectively. With only 1.91M parameters, our model effectively reduces the total computational complexity of the original MelGAN from 5.85 to 0.95 GFLOPS. Our Pytorch implementation, which will be open-resourced shortly, can achieve a real-time factor of 0.03 on CPU without hardware specific optimization.

Comments:	Submitted to Interspeech2020
Subjects:	Sound (cs.SD); Audio and Speech Processing (eess.AS)
Cite as:	arXiv:2005.05106 [cs.SD]
	(or arXiv:2005.05106v2 [cs.SD] for this version)
	https://doi.org/10.48550/arXiv.2005.05106

Submission history

From: Geng Yang [view email]
[v1] Mon, 11 May 2020 13:48:41 UTC (299 KB)
[v2] Tue, 17 Nov 2020 07:07:23 UTC (1,568 KB)

Computer Science > Sound

Title:Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Sound

Title:Multi-band MelGAN: Faster Waveform Generation for High-Quality Text-to-Speech

Submission history

Access Paper:

References & Citations

DBLP - CS Bibliography

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators